To better understand the strengths and weaknesses of the BioLexicon
and 
BioVALEX,
% the Cambridge system, 
we evaluated them against the SEM-30 and SEM-26 gold standards
(see 
% performed an evaluation against the gold
% standard described in 
Section~\ref{gold}).

First we evaluated 
BioVALEX
% the Cambridge system 
against SEM-30, using
the two filtering methods described in Section~\ref{subcat_system}. We
used the same evaluation measures as previous SCF evaluations for
general language \citep{korhonen:02,preiss:07}, namely type precision,
type recall, and F-score (the harmonic mean of precision and
recall). We also looked at the number of gold standard SCFs unseen in
the system output; that is, false negatives which were not detected at
all by the classifier (rather than being filtered out). 
%  It is
% not a perfect system for evaluation, since there may be frames in the
% gold standard that don't appear in the input corpus and vice versa,
% but it is the best one avaiable. 

% We also compared the similarity between the acquired unfiltered
% distributions against the gold standards using various measures of
% distributional similarity: Kullback-Leibler distance (\textsc{kl}),
% Jensen-Shannon divergence (\textsc{js}), cross entropy (\textsc{ce}),
% skew divergence (\textsc{sd}) and intersection (\textsc{is}).  See
% \citep{korhonen:02b} for a description of these measures and their
% application to SCF acquisition. A detailed description of \textsc{js}
% can be found in Section~\ref{subdomain_variation_methods} where it is
% used as a measure of subdomain variation.

The accuracy of 
BioVALEX 
% the Cambridge SCF lexicon 
on SEM-30 is shown in
Table~\ref{t:mainresult}. 
% The first row shows type precision and
% recall using a relative frequency threshold of 0.03. The second row
% shows the results using SCF-specific filtering (see
% Section~\ref{subcat_system}) for a full description of the filtering
% methods). 
With the relative frequency threshold we see an overall
F-score of approximately 45, with recall favored over precision. Using
SCF-specific thresholds we obtain an F-score of nearly 60 with
precision slightly favored over recall. This improvement shows that
knowledge about general language SCFs can be useful for filtering in
biomedicine. The high score of 60 is about 9 points lower than can be
obtained for general language, e.g.~\citep{preiss:07}. This result is
respectable considering that no adaptations were made to the
acquisition system besides applying it to large biomedical corpus, but
it shows that there is scope for adaptation to the biomedical
domain.  Note also that no SCFs are unseen in the unfiltered lexicon,
showing that the system is capable of finding all the SCFs in the gold
standard.

\begin{table}[]
\begin{tabular}{| l | r | r | r | r |}
  \hline
  Filtering     &  F-score & Prec & Rec & Unseen\\
  \hline
  \hline
  0.03 thresh & 44.96 & 39.37 & 52.41 & 0\\
  SCF-specific  & 59.94   & 60.87     & 59.04 & 0 \\
  \hline
\end{tabular}
\caption{Accuracy of the Cambridge SCF acquisition system on the 30-verb semantic gold standard.}
\label{t:mainresult}
\end{table}

Second, we performed a comparative evaluation of 
BioVALEX
% the Cambridge lexicon
with the BioLexicon. Since the BioLexicon includes only 26 of the 30
verbs in SEM-30 (see Table~\ref{t:verblist}), we used SEM-26, a subset of SEM-30 consisting of those 26 verbs.

The comparative evaluation was not straightforward since 
% This was not straightforward for a number of
% reasons. Note first that the BioLexicon . In addition,
the BioLexicon SCF inventory is quite different from the Cambridge
inventory, and the mapping between the two inventories is
many-to-many. Recall that the BioLexicon inventory is induced from the
parsed corpus, whereas the Cambridge inventory is pre-defined. Each
inventory is more fine-grained in certain areas. For example, the
Cambridge inventory includes multiple frames for various constructions
that are distinguished in linguistics, such as predicate nominals
({\it He seemed a fool}, where {\it fool} is predicated of {\it He})
as opposed to direct object nominals ({\it He saw a fool}). On the
other hand, the BioLexicon inventory differentiates SCFs with PP
arguments according to preposition, so that NP-PP-{\it through},
NP-PP-{\it from}, and NP-PP-{\it for}, etc. are different frames,
while the Cambridge inventory has only two frames with an NP-PP
configuration.  Since SEM-26 is annotated using the Cambridge
inventory, we had two options: map the BioLexicon inventory to the
Cambridge inventory and evaluate the BioLexicon directly on SEM-26, or
modify SEM-26 to use a common intermediate representation and map both
inventories to this representation for evaluation. We used both these options.

We call the manual mapping from the BioLexicon inventory to the Cambridge inventory
``best match''. Here we manually examined each SCF in the BioLexicon
inventory and chose which single SCF in the Cambridge inventory it was most
likely to correspond to. Following the example above, the BioLexicon
frame NP-PP-{\it through}, NP-PP-{\it from}, etc. would map to the Cambridge
frame NP-PP. Similarly, the BioLexicon frame NP-PP-{\it into}-PP-{\it on}
would map to the Cambridge frame NP-PP-PP.  This process
resulted in a set of 22 SCF types for the BioLexicon. This is far lower
than the 97 SCF types reported in \citep{biolexicon:2008}\footnote{We found 136 SCF types when querying the BioLexicon.}, 
since we collapse the SCFs that are lexicalized for preposition.
% but that total
% includes the frames lexicalized for preposition, which we could not
% evaluate separately given the information available in our gold
% standard.

% For
% example, our inventory has only a single PP-PP SCF type in which the
% verb takes two PP complements; it does not matter which prepositions
% occur in the complements.  However, BioLexicon has 13 SCF types of
% this same form, distinguished by different combinations of
% prepositions.

After performing the ``best match'' mapping, we evaluated the
BioLexicon and 
BioVALEX
% the Cambridge system 
directly on SEM-26. A simple
relative frequency threshold of 0.03 was used for filtering in both
the BioLexicon and 
BioVALEX.
% the Cambridge system. 
Although we have demonstrated that general language statistics can be
successfully used for SCF-specific filtering in biomedicine, we used relative
frequency thresholds for the comparative evaluation since it provides
a level playing field for the BioLexicon and BioVALEX.
% We chose to use simple
% relaative frequency for two reasons: it is more fair to the BioLexicon;
% due to the nature of the SCF-specific filtering we were unable to
% perform it for BioLexicon. Also after demonstrating that general
% language statistics can be used for filtering, we wanted to rely as
% little as possible on general language data.

We again report type precision, type recall, and F-score. We also
report the number of SCFs missing from the {\it filtered} lexicon.
% (whereas above we reported the number of SCFs unseen in the {\it unfiltered} lexicon for the Cambridge system). 
Note that an SCF may be missing from the filtered lexicon either because it was not in the lexicon at all, or because it had low frequency and was filtered out.\footnote{We do not report SCFs unseen in the unfiltered lexicon here, because we did not have access to the unfiltered BioLexicon.}

The accuracy of BioVALEX
% the Cambridge lexicon 
and the BioLexicon using ``best match'' on SEM-26 is shown
in Table~\ref{t:best-match}.\footnote{Note that the figures for BioVALEX with the 0.03
threshold differ from those in Table~\ref{t:mainresult} because they
are for only 26 verbs.}
% Here we mapped each BioLexicon frame to
% a single SCF in the Cambridge inventory using ``best match'', as described in
% Section~\ref{evaluation_methods}.  
We can see that the BioLexicon has a much higher
F-score even with simple filtering, approaching the F-score achieved
by 
BioVALEX
% the Cambridge system 
with SCF-specific filtering. Interestingly, we can also
see that the BioLexicon strongly favors precision over recall, while 
BioVALEX
% the Cambridge system 
is stronger on recall. The high precision of the BioLexicon is a result of the fact that it is produced with a deep, lexicalized parser already
adapted to the biomedical domain, including a POS tagger trained on biomedical text. This means that the output of the parsing stage already took into account some subcategorization information specific to the biomedical domain. This results in a high-precision system for biomedical text, but relies on up-front domain adaptation, whereas the Cambridge system is less precise but can be ported to new domains as long as there is a large corpus of raw data available. 

The lower recall likely reflects the
fact that the BioLexicon is based on only a single subdomain of
biomedicine, while 
BioVALEX 
% the Cambridge lexicon 
is built from across PMC OA. It can be seen that the Cambridge system is able to hypothesize SCFs which are likely to be important for interpretation of the text. The trade-off is that the Cambridge system hypothesizes more frames overall, resulting in relatively low precision. This can be overcome in the future, however, with more sophisticated filtering methods, as suggested by the results in Table~\ref{t:mainresult}.

The importance of the higher recall for information extraction can be
seen when we look at the SCFs in SEM-26 which are not included in the BioLexicon. The sentences
in Figure~\ref{f:missing-fine} are examples of frames which appear in 
BioVALEX
% the Cambridge lexicon
but not in the BioLexicon. Note that the BioLexicon may include these frames for other verbs, but at least for the verbs in SEM-26 they were either filtered out or not present to begin with.

% MISSING SCFS 23, 30, 31, 36, 37, 55, 95, 109, 122

\begin{figure}
\begin{tabular}{|p{0.9\columnwidth}|}
\hline
 NP-ING-SC:\\ 
This study indicates that all treatment protocols seemed to be sufficiently effective and safe and that cheyletiellosis in rabbits can be successfully \underline{treated} using ivermectin or selamectin in clinical practice.\\[2pt]
While the AT immunologic activity is normal in this deficiency , plasma AT functional activity is markedly \underline{reduced} leading to risk of thrombosis.\\[5pt]
NP-PP-PP:\\ 
This phenomenon is dose-dependently \underline{inhibited} by leukotriene receptor antagonism with FPL 55712, SK\&F 104353 and montelukast.\\[2pt]
Thus , unlike the Tetrahymena ribozyme , the changes \underline{induced} in precursor RNA by incubation in the absence of divalent cations result in activation of the ribozyme.\\
\hline
\end{tabular}
\caption{Examples of SCFs in SEM-26 and BioVALEX but missing from the BioLexicon.}
\label{f:missing-fine}
\end{figure}



% 36: NP-ING-SC: activate: Proto-oncogenes in retrovirally induced myeloid mouse leukemias are frequently activated following retroviral insertion .
% 36: transcribe: Whole cell RNA was reversely transcribed using an oligo-dT primer .
% 37: ditransitive: which verbs in gs really have?

\begin{table}
\begin{tabular}{| l | r | r | r | r | }
  \hline
  Lexicon             & F-score & Prec & Rec & Missing \\
  \hline
  \hline
  BioVALEX    & 46.20   & 40.00     & 54.68 & 11\\
  BioLexicon          & 58.37   & 87.14     & 43.88 & 20 \\
  \hline
\end{tabular}
\caption{Accuracy of BioVALEX (threshold 0.03) and the BioLexicon, using best-match, on the 26 verbs in the intersection of the semantic gold standard and the BioLexicon.}
\label{t:best-match}.
\end{table}

% This time we report the SCFs
% {\it missing from the filtered lexicon}, i.e.~which may not have
% appeared in the lexicon at all or which may have been filtered out,
% rather than the SCFs unseen in the unfiltered lexicon, as in the
% previous evaluation. This is because we had no access to the
% unfiltered BioLexicon.

Because of the many-to-many nature of the mapping, we were concerned
that the ``best match'' mapping might be unfair to the BioLexicon,
since it only captured one SCF in the Cambridge inventory for each SCF
in the BioLexicon, even though there might be more than one legitimate
choice. Therefore, we also pursued another method to handle the
different SCF inventories.

We created from our gold standard a new gold standard with a much
coarser-grained SCF inventory. We did this by semi-manually creating
equivalence classes of SCFs based on types that both the BioLexicon
and Cambridge inventory could differentiate. First, we expanded the
best match by manually defining a more ``inclusive'' match, listing
all the Cambridge inventory SCFs which could be a match to a
BioLexicon SCF (accounting for the one-to-many aspect of the
BioLexicon-Cambridge inventory mapping). Then we created a bipartite
graph in which one set of nodes represented the Cambridge SCFs and the
other set represented the BioLexicon SCFs. Edges represented
``inclusive'' mapping rules. Each connected component was then considered a
coarse SCF. We call this mapping to coarse-grained SCFs semi-manual,
because the inclusive mapping rules were manually defined, but the
equivalence classes were found automatically.

% First, to account for the cases where one BioLexicon SCF matched many
% Cambridge SCFs. We manually defined a more ``inclusive'' match that
% listed all the Cambridge SCFs which could be a match to a BioLexicon
% SCF. For example, the best match for BioLexicon NP-TOINF (?) was
% Cambridge NP-TO-INF-OC, e.g. ``I advised Mary to go''. However, this
% could also have been the Cambridge frame NP-TOBE ``I found him to be a
% good doctor'', since we don't know from the BioLexicon frame whether a
% form of ``to be'' or a different verb followed the TO-INF.

% No prepositions were considered.  Some Cambridge SCFs did not correspond to any BioLexicon SCFs, and these were left out of the evaluation.

The resulting coarse-grained inventory contained 14 broad SCFs. We
% then 
evaluated both BioVALEX
% the Cambridge lexicon 
and the BioLexicon against a version of the gold
standard which had been mapped to this coarser inventory. We again
report type precision, type recall, F-score, and missing SCFs.


\begin{table}
\begin{tabular}{| l | r | r | r | r |}
  \hline
  Lexicon        &     F-score & Prec & Rec & Missing\\
  \hline
  \hline
  BioVALEX   & 65.38   & 55.43     & 79.69 & 2\\
  BioLexicon  & 69.23   & 90.00     & 56.25 & 4\\
  \hline
\end{tabular}
\caption{Accuracy of BioVALEX (threshold 0.03) and the BioLexicon using coarse-grained inventory, on the 26 verbs in the intersection of the semantic gold standard and the BioLexicon.}
\label{t:coarse}
\end{table}

The results using the coarse-grained inventory are shown in
Table~\ref{t:coarse}. As expected, both lexicons show higher accuracy
when evaluated using this more forgiving inventory. The general trends found earlier
still hold, however, with the BioLexicon favoring precision while 
BioVALEX
% the Cambridge system 
favors recall.

Note that even on the coarse-grained SCF inventory, the BioLexicon is
missing more SCFs post-filtering than 
BioVALEX.
% the Cambridge lexicon. 
The sentences
in Figure~\ref{f:missing-coarse} show examples of frames that were missing
from the BioLexicon for the verbs in our gold standard, but not from the Cambridge
lexicon.

% MISSING: 19/20/21, 104/107/109

\begin{figure}
\begin{tabular}{|p{0.9\columnwidth}|}
\hline
THAT-S:\\ 
Additionally , our image analysis allowed us to \underline{detect} that FTG mice also ventured further into the open arm compared to FNTG controls.\\[2pt]
All the caregivers \underline{expressed} that the feeling of safety for the patient and the caregiver was essential, emphasizing that professional back-up 24 hours a day was important.\\[5pt]
ING:\\ 
All of these stimuli \underline{activated} signaling through the MAP kinase/ERK pathway and led to the induction of P-YB-1S102.\\[2pt]
Although none of the mutations \underline{increased} binding to the same degree as removing the entire USH, they had little effect on the solubility of the protein compared to removal of the entire USH.\\

\hline
\end{tabular}
\caption{Examples of SCFs in the coarse-grained version of SEM-26 and BioVALEX but missing from the BioLexicon.}
\label{f:missing-coarse}
\end{figure}


% Finally, we evaluate the accuracy of our best-performing system, with
% SCF-specific filtering, on the 10 verbs in the semantic and syntactic
% gold standards, using the same evaluation measures (we don't report
% unseen SCFs here, since it would not be meaningful on such a small
% number of verbs).
% To conclude,
%  this section, 
Overall, it can be seen that the accuracy of both BioVALEX and the
BioLexicon against a biomedical gold standard is lower than for
general language SCF acquisition against general language SCF gold standards.
We believe 
the lower accuracy
% these problems 
arises from different sources for the two
lexicons. 
BioVALEX is insufficiently adapted to biomedical text, and hypothesizes a
wide variety of SCFs inappropriate for the domain, resulting in low precision. The BioLexicon, on the other hand, suffers from lower recall, which may mean that
% For the BioLexicon, 
% we can see that 
% it appears that
a system whose components
have been manually adapted 
to a single subdomain does not generalize well enough to the
% overall
variety of
% different
subdomains in PMC OA.
% ; precision is high, but there is a limitation to how far this approach can take us. 
% For BioVALEX,
% the Cambridge system, 
% we can see that it is not well enough adapted; although recall is higher,
% precision leaves something to be desired. 
Domain adaptation for SCF acquisition is clearly needed
% The field of biomedical informatics needs domain adaptation
% for SCF acquisition 
in order to create accurate, scalable SCF lexicons
to help with downstream NLP tasks, but 
% it needs to be a new,
% less-supervised kind of domain adaptation.
a less-supervised approach is required to avoid over-adaptation to a single subdomain.
